Bike Rental

Background

test

Setup & Data Load

# load packages
library(readxl)
library(ggplot2)
library(lubridate)
library(here)
library(scales)
library(dplyr)
library(plotly)
require(webshot)
require(knitr)


# Setting locale
Sys.setlocale("LC_ALL", 'German_Switzerland')
## [1] "LC_COLLATE=German_Switzerland.1252;LC_CTYPE=German_Switzerland.1252;LC_MONETARY=German_Switzerland.1252;LC_NUMERIC=C;LC_TIME=German_Switzerland.1252"
# Set working directory for R and knitr
setwd(here('Group_Work'))
opts_knit$set(root.dir = here('Group_Work'))

# Get rental data
bikedata.source <- read.csv2('./source/bikesharing/SeoulBikeData.csv', sep=',')
str(bikedata.source)
## 'data.frame':    8760 obs. of  14 variables:
##  $ Date                     : chr  "01/12/2017" "01/12/2017" "01/12/2017" "01/12/2017" ...
##  $ Rented.Bike.Count        : int  254 204 173 107 78 100 181 460 930 490 ...
##  $ Hour                     : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Temperature..C.          : chr  "-5.2" "-5.5" "-6" "-6.2" ...
##  $ Humidity...              : int  37 38 39 40 36 37 35 38 37 27 ...
##  $ Wind.speed..m.s.         : chr  "2.2" "0.8" "1" "0.9" ...
##  $ Visibility..10m.         : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 1928 ...
##  $ Dew.point.temperature..C.: chr  "-17.6" "-17.6" "-17.7" "-17.6" ...
##  $ Solar.Radiation..MJ.m2.  : chr  "0" "0" "0" "0" ...
##  $ Rainfall.mm.             : chr  "0" "0" "0" "0" ...
##  $ Snowfall..cm.            : chr  "0" "0" "0" "0" ...
##  $ Seasons                  : chr  "Winter" "Winter" "Winter" "Winter" ...
##  $ Holiday                  : chr  "No Holiday" "No Holiday" "No Holiday" "No Holiday" ...
##  $ Functioning.Day          : chr  "Yes" "Yes" "Yes" "Yes" ...

As the output above shows, the bike rental data from Seoul (source: https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand) contains over 8700 observations and 14 variables or attributes. This includes hourly number of bike rentals including meteorological information - such as humidity, temperature, rainfall etc. - as well as ‘holiday’ indicators. R tried to automatically determine the data types of our attributes but mis-categorized some. All attributes will be cast to their required data type in the next step (Data Transformation).

# Get airport arrival data 2017 & 2018
air17.source <- read_excel('./source/bikesharing/Airport_Statistics_Stat_by_Time_of_Day_201701_201712_edited.xls', sheet = 1)
air18.source <- read_excel('./source/bikesharing/Airport_Statistics_Stat_by_Time_of_Day_201801_201812_edited.xls', sheet = 1)
str(air17.source) #str(air18.source) # same as air17 but for year 2018
## tibble [25 x 4] (S3: tbl_df/tbl/data.frame)
##  $ time     : chr [1:25] "00:00 ~ 00:59" "01:00 ~ 01:59" "02:00 ~ 02:59" "03:00 ~ 03:59" ...
##  $ Arrival  : chr [1:25] "109,770" "28,682" "46,792" "179,620" ...
##  $ Departure: chr [1:25] "498,707" "254,771" "64,177" "18,400" ...
##  $ Total    : chr [1:25] "608,477" "283,453" "110,969" "198,020" ...

In addition to the above bike rental data set, we will employ data from the Incheon Airport Seoul (source: https://www.airport.kr/co/en/cpr/statisticCategoryOfTime.do) which lists the total number of passengers (arrivals) for 2017 and 2018, grouped by hour of the day. This grouping was chosen due to the fact that the available bike data is also grouped on an hourly basis.

Data Transformation

In order to start analysing and building/validating models on this data, the available files and attributes need to be transformed into a suitable, standardise and combined form. Data transformation includes the following activities, depending on the specific dataset(s):

Let us start with the rental bike dataset. Below we will cast the attributes to their most appropriate type. Potential issues with missing / invalid data should be revealed during that process.

Missing Values

Before we continue to the further preparation steps, now would be a good time to check for missing values and handle them appropriately (before merging all datasets into one final dataframe). We use is.na() and a search for empty strings to identify potentially missing values. If feasible and necessary, further imputations will be made to replace those missing values. The output of the various function calls below is suppressed but did not show any missing values or empty strings.

# bike data set
bikedata.source[is.na(bikedata.source),]
# ==> no NA values!
bikedata.source[bikedata.source == '',]
# ==> no empty strings!

# air 2017 dataset
air17.source[is.na(air17.source)]
# ==> no NA values!
air17.source == ''
# ==> no empty strings!

# air 2018 dataset
is.na(air18.source)
# ==> no NA values!
air18.source == ''
# ==> no empty strings!

Casting of Data Types

First, let’s cast the attributes of bikedata into their correct data types:

# creating cleaned version of source data with correctly cast data types for each attribute
bikedata.c <- data.frame(
    Date =          as.Date(bikedata.source$Date, format='%d/%m/%Y'),
    Year =          year(as.Date(bikedata.source$Date, format='%d/%m/%Y')),
    RentCount =     bikedata.source$Rented.Bike.Count,
    Hour =          bikedata.source$Hour,
    Temperature =   as.double(bikedata.source$Temperature..C., length = 1),
    Humidity =      bikedata.source$Humidity...,
    Windspeed =     as.double(bikedata.source$Wind.speed..m.s., length = 1),
    Visibility =    bikedata.source$Visibility..10m.,
    DewPointTemp =  as.double(bikedata.source$Dew.point.temperature..C., length = 2),
    SolarRadiation= as.double(bikedata.source$Solar.Radiation..MJ.m2., length = 1),
    Rainfall =      as.double(bikedata.source$Rainfall.mm., length = 1),
    Snowfall =      as.double(bikedata.source$Snowfall..cm., length = 1),
    Season =        factor(bikedata.source$Seasons),
    Holiday =       factor(bikedata.source$Holiday),
    Functional =    factor(bikedata.source$Functioning.Day)
)
# check
str(bikedata.c)
## 'data.frame':    8760 obs. of  15 variables:
##  $ Date          : Date, format: "2017-12-01" "2017-12-01" ...
##  $ Year          : num  2017 2017 2017 2017 2017 ...
##  $ RentCount     : int  254 204 173 107 78 100 181 460 930 490 ...
##  $ Hour          : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Temperature   : num  -5.2 -5.5 -6 -6.2 -6 -6.4 -6.6 -7.4 -7.6 -6.5 ...
##  $ Humidity      : int  37 38 39 40 36 37 35 38 37 27 ...
##  $ Windspeed     : num  2.2 0.8 1 0.9 2.3 1.5 1.3 0.9 1.1 0.5 ...
##  $ Visibility    : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 1928 ...
##  $ DewPointTemp  : num  -17.6 -17.6 -17.7 -17.6 -18.6 -18.7 -19.5 -19.3 -19.8 -22.4 ...
##  $ SolarRadiation: num  0 0 0 0 0 0 0 0 0.01 0.23 ...
##  $ Rainfall      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Snowfall      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Season        : Factor w/ 4 levels "Autumn","Spring",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Holiday       : Factor w/ 2 levels "Holiday","No Holiday": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Functional    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...

The casting of data types did not show any problems with invalid data / missing values so no removal / cleanup of that dataset seems to be necessary at this point. In the next step we will continue with the two air passenger datasets for Incheon Airport. This dataset is split by year. After casting all attributes to appropriate data types, the two data sets (2017 and 2018) will be combined into one dataframe with the addition of a year attribute.

# before going further, we need to remove the 'TOTAL' row from the data set
air17.edit <- air17.source[air17.source$time != 'Total',]
air18.edit <- air18.source[air18.source$time != 'Total',]

# 2017 air passenger data Seoul
air17.c <- data.frame(
    Year =          2017,
    Hour =          as.integer(substring(air17.edit$time, 1, 2)),
    Arrival =       as.integer(gsub(',', '', air17.edit$Arrival)),
    Departure =     as.integer(gsub(',', '', air17.edit$Departure)),
    TotalArrival =  as.integer(gsub(',', '', air17.source$Arrival[length(air17.source$Arrival)])), # get total arrivals for 2017
    TotalDeparture= as.integer(gsub(',', '', air17.source$Departure[length(air17.source$Departure)])) # get total departures for 2017
)
# 2018 air passenger data Seoul
air18.c <- data.frame(
    Year =          2018,
    Hour =          as.integer(substring(air18.edit$time, 1, 2)),
    Arrival =       as.integer(gsub(',', '', air18.edit$Arrival)),
    Departure =     as.integer(gsub(',', '', air18.edit$Departure)),
    TotalArrival =  as.integer(gsub(',', '', air18.source$Arrival[length(air18.source$Arrival)])), # get total arrivals for 2018
    TotalDeparture= as.integer(gsub(',', '', air18.source$Departure[length(air18.source$Departure)])) # get total departures for 2018
)

Data Merge

As a last step in the transformation phase, the currently separate datasets air.17, air.18 and bikedata will be merged together to create one single data frame to be used for the further analysis. In a first step, the two *air** datasets will be merged, before then joining them to the bikedata dataset.

# combine air17 and air 18 into air.c
air.c <- rbind(air17.c, air18.c)
# adding relative calculations (percentage of total yearly arrivals/departures per hour)
ArrivalPct <- air.c$Arrival / air.c$TotalArrival
DeparturePct <- air.c$Departure / air.c$TotalDeparture
air.c <- cbind(air.c, ArrivalPct, DeparturePct)
# result check
head(air.c)

The air passenger data sets required a bit more attribute engineering. We first had to remove the TOTAL row from both files (2017 & 2018). After that, we added a Year attribute to enable us to latter merge the two data sets. We then isolated a substring of the time description (used to be in the form of “00:00 ~ 00:59”) to standardise the values to the same values found in the bike sharing data set ( integers from 0 to 23 ). We then had to replace “,” characters found in the total Arrival and Departure counts before being able to cast them to integer. Further, two new attributes TotalArrival and TotalDeparture were added to both the 2017 and 2018 data which hold the total arrivals and departures per year respectively. This information will be used to calculate the percentage of arrivals/departures per hour compared to the total arrivals/departures per year. The last step combined the two year 2017 and 2018 into one data frame air.c.

The next step will merge the air passenger information with the bike sharing data to create one data frame as the basis for further analysis and modelling.

The last step will then serialise the resulting object (the air.c data frame) so that the data preparation steps do not have to be run everytime the analysis and/or the R markdown file are run. This allows us to simply load the created and exported data frame via the readRDS() function before starting with the analysis.

# merging the two data frames based on 'Year' and 'Hour)'
data.m <- merge(bikedata.c, air.c, by = c('Year','Hour'), all = TRUE)
# sorting the result by 'Date' and 'Hour'
data.m <- data.m[order(data.m$Date, data.m$Hour),]
# persist resulting (basis) dataframe for further analysis via serialisation
saveRDS(data.m, file='./db/data.RDS')

Exploratory Analysis

setwd(here('Group_Work'))
data <- readRDS('./db/data.RDS')
head(data)

Let us now create various graphs to get insights to the data. We are interested to see how the bike rental count developed in the months from end of 2017 to end of 2018.

#head(data)

data$Month <- as.Date(cut(data$Date,
  breaks = "month"))

ggplot(data = data, mapping = aes(x=Month, y=RentCount))+
  stat_summary(fun = sum, geom = "line", color="blue")+
  labs(title="Bike Rental Monthly", y="Bike Rentals")+
  scale_x_date(labels =  date_format("%m/%y"), breaks = "1 month")

Clearly a seasonality is visible in the data. Let’s compare the different seasons by creating boxplots for each season.

#This shows the hourly data not daily
ggplot(data = data, mapping = aes(x=Season, y = RentCount)) +
  geom_boxplot()+
  labs(y = "Bike Rentals")

As expected a seasonality of bike rentals is visible. More bike rentals occur during the summer months compared to the winter. Autumn and spring show similar distribution.

Let’s see how the rental count behaves during a day. Are certain hours preferred for renting a bike? In order to get a first answer to the question we will average over the hours.

ggplot(data = data, mapping = aes(x=as.factor(Hour), y=RentCount))+
  stat_summary(fun="mean", geom="bar")+
  labs(y = "Bike Rentals", x = "Hour")

We assume that the temperature influences the number of bikes rented. Let’s create a scatterplot comparing those two variables.

#Plot the hourly count values based on recorded temperatures (in °C)
ggplot(data = data, mapping = aes(y = RentCount, x = Temperature)) +
  geom_point()+
  geom_smooth(se = FALSE)

The resulting plot shows an interesting relationship between temperature and the hourly rental count:

This seems inherently explainable since people are less likely to use a bicycle in full summer heat and to prefer alternative modes of transport (preferably with A/C).

Linear Modelling

Implementing first (simple) Model

As we are looking at count data it makes sense to use a generalised linear model (GLM) in order to specify the distribution of our data as a poisson and apply the appropriate link function (logarithm). In a first, simple model we will look into the effect of the season categorical variable (factor in R) onto the target variable rental count.

glm.bike <- glm(RentCount ~ Season,
                family = "poisson",
                data = data)

glm.bike
## 
## Call:  glm(formula = RentCount ~ Season, family = "poisson", data = data)
## 
## Coefficients:
##  (Intercept)  SeasonSpring  SeasonSummer  SeasonWinter  
##       6.7088       -0.1157        0.2324       -1.2903  
## 
## Degrees of Freedom: 8759 Total (i.e. Null);  8756 Residual
## Null Deviance:       4979000 
## Residual Deviance: 3682000   AIC: 3749000
# Absolute count numbers per model
paste(  'Intercept (Autumn) = ', exp(coef(glm.bike))[1],
        '\nSpring Absolute Count = ', round(exp(coef(glm.bike))[1] * exp(coef(glm.bike))[2],0),
        '\nSummer Absolute Count = ', round(exp(coef(glm.bike))[1] * exp(coef(glm.bike))[3], 0),
        '\nWinter Absolute Count = ', round(exp(coef(glm.bike))[1] * exp(coef(glm.bike))[4], 0)
)
## [1] "Intercept (Autumn) =  819.597985348541 \nSpring Absolute Count =  730 \nSummer Absolute Count =  1034 \nWinter Absolute Count =  226"

As suspected when investigating the boxplots based on season, temperature obviously has a significant impact on the bike rental count. However, as also evident in these results, there seems to be only a relatively minor difference between autumn, spring and summer (\(819\), \(0.9 * 819 = 730\), \(1.26 * 810 = 1034\)). The only relatively large differences occurs between these three seasons and winter (\(0.28 * 819 = 226\)).

Model Performance

In order to be able to compare the model performance of our first simple model to later, improved versions we will look into model performance by analysing the mean squared error(MSE) of our current model.

Let’s start by visualising the residuals (errors) per sample point. After that the MSE will be calculated.

plot(resid(glm.bike))

# calculating the MSE using the residuals (errors) of the simple model
glm.bike.mse <- mean((data$RentCount-fitted(glm.bike))^2)

#data.frame(fitted(glm.bike), data$RentCount, fitted(glm.bike) - data$RentCount)
glm.bike.mse
## [1] 328564.5

The plot of residuals shows a few interesting characteristics:

  • the error for the first ~2000 samples seems to be significantly smaller than for the remaining 6000 samples
  • the model tends to underestimate the actual value since residuals are largely positive and have a larger positive extreme (around 60) compared to the negative extreme (around -40)

The calculated MSE is 328564.5 (\(MSE = 1/n * \sum(Y-\hat{Y})^2 = 328564.5\))

In order to get a more reliable measure of fit, we will further compute the test set MSE using 10-fold cross-validation in the next part:

# split dataset into 10 parts
ids <- sample(1:10, nrow(data), replace=TRUE  )
df.test <- data.frame(data, id = round(ids,0))

# fit model to each training set and compute test MSE
mse <- c()
for(i in 1:10){
  fit.updated <- update(glm.bike,
                         data = df.test[df.test['id']!=i,])
  # predict using test set
  fit.test <- predict(fit.updated, newdata = df.test[df.test['id']==i,])
  # store results in df
  pred.test <- data.frame( pred = exp(fit.test), obs = df.test[df.test['id']==i,]$RentCount)

  # compute MSE
  mse[i] <- mean((pred.test$obs - pred.test$pred)^2)
}
mean(mse)
## [1] 328652.7

This yields a 10-fold cross validated testing MSE of 328751.4, slightly better than the in-sample / training MSE.

Extended Modelling

[CHAPTER OF CHOICE] - Exploring plotly

NOTE: The following plots are meant to be interactive and are therefore best enjoyed in the .HTML markdown version!!

For the chapter of choice - with the aim to use a package not mentioned in the course - we will explore the plotly library. This library allows to embed advanced, interactive charts in markdown by using the plotly.js JavaScript library. The documentation for this library can be found here: https://plotly.com/r/

Lets start with a simple, multi-dimensional plot using the plot_ly() function. Per default, plotly graphs allow for interactive selections and filtering, which is very nice!

#library(plotly) #install.packages('plotly')

fig <- plot_ly(data=data,
        x=~Temperature,
        y=~Hour,
        z=~RentCount,
        type="scatter3d", mode="markers")
fig

That does not seem too bad! We have quickly created an interactive 3D plot which allows us to change the perspective and hover over single data points. However, the plot area is rather small and there are many data points to be displayed. In the current form, its hard to distinguish the separate points. Let’s see if we can adjust the plot area accordingly and create more room for the 3D plot to shine. In order to show case interactive capabilities of plotly, we will further add a categorical variable and use this to color the different data points. By doing so, plotly will automatically add a filter and allow to select different levels of the factor variable.

fig2 <- plot_ly(data=data,
        x=~Temperature,
        y=~Hour,
        z=~RentCount,
        type="scatter3d", mode="markers", color=~Season #here, we added the color argument and set it to the 'Season' factor
)
# in order to adjust the plot area, we have to adjust the width, height and margins of the graph as well as deactivate autosize
fig2 %>% layout(autosize = FALSE, width = 1000, height = 600, margin = list(l=50, r=10, b=10, t=10, pad=1))









Now this seems a lot better! In this version, we are able to select/deselect one ore several seasons and only show the necessary observations belonging to those seasons. The graph also has more room so the single observations become apparent and we can see all axis and their labels.